Scalable and Efficient Construction of Suffix Array with MapReduce and In-Memory Data Store System

نویسندگان

Hsiang-Huang Wu

Chien-Min Wang

Hsuan-Chi Kuo

Wei-Chun Chung

Jan-Ming Ho

چکیده

Suffix Array (SA) is a cardinal data structure in many pattern matching applications, including data compression, plagiarism detection and sequence alignment. However, as the volumes of data increase abruptly, the construction of SA is not amenable to the current large-scale data processing frameworks anymore due to its intrinsic proliferation of suffixes during the construction. That is, ameliorating the performance by just adding the resources to the frameworks becomes less costeffective, even having the severe diminishing returns. At issue now is whether we can permit SA construction to be more scalable and efficient for the everlasting accretion of data by creating a radical shift in perspective. Regarding TeraSort [1] as our baseline, we first demonstrate the fragile scalability of TeraSort and investigate what causes it through the experiments on the sequence alignment of a grouper (i.e., the SA construction used in bioinformatics). As such, we propose a scheme that amalgamates the distributed key-value store system into MapReduce to leverage the in-memory queries about suffixes. Rather than handling the communication of suffixes, MapReduce is in charge of the communication of their indexes, which means better capacity for more data. It significantly abates the required disk space for constructing SA and better utilizes the memory, which in turn improves the scalability radically. We also examine the efficiency of our scheme in terms of memory and show it outperforms TeraSort. At last, our scheme can complete the pairend sequencing and alignment with two input files without any degradation on scalability, and can accommodate the suffixes of nearly 6.7 TB in a small cluster composed of 16 nodes and Gigabit Ethernet without any compression.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient and Scalable Indexing Techniques for Biological Sequence Data

We investigate indexing techniques for sequence data, crucial in a wide variety of applications, where efficient, scalable, and versatile search algorithms are required. Recent research has focused on suffix trees (ST) and suffix arrays (SA) as desirable index representations. Existing solutions for very long sequences however provide either efficient index construction or efficient search, but...

متن کامل

Scalable Construction of Text Indexes

The suffix array is the key to efficient solutions for myriads of string processing problems in different applications domains, like data compression, data mining, or Bioinformatics. With the rapid growth of available data, suffix array construction algorithms had to be adapted to advanced computational models such as external memory and distributed computing. In this article, we present five s...

متن کامل

Scalable Parallel Suffix Array Construction

Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications in particular in bioinformatics. We describe the first implementation and experimental evaluation of a scalable parallel algorithm for suffix array construction. The implementation works on distributed memory computers using MPI, Experi...

متن کامل

A Space-Efficient Construction of the Burrows-Wheeler Transform for Genomic Data

Algorithms for exact string matching have substantial application in computational biology. Time-efficient data structures which support a variety of exact string matching queries, such as the suffix tree and the suffix array, have been applied to such problems. As sequence databases grow, more space-efficient approaches to exact matching are becoming more important. One such data structure, th...

متن کامل

Optimal Time and Space Construction of Suffix Arrays and LCP Arrays for Integer Alphabets

Suffix arrays and LCP arrays are one of the most fundamental data structures widely used for various kinds of string processing. Many problems can be solved efficiently by using suffix arrays, or a pair of suffix arrays and LCP arrays. In this paper, we consider two problems for a string of length N , the characters of which are represented as integers in [1, . . . , σ] for 1 ≤ σ ≤ N ; the stri...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1705.04789 شماره

صفحات -

تاریخ انتشار 2017

Scalable and Efficient Construction of Suffix Array with MapReduce and In-Memory Data Store System

نویسندگان

چکیده

منابع مشابه

Efficient and Scalable Indexing Techniques for Biological Sequence Data

Scalable Construction of Text Indexes

Scalable Parallel Suffix Array Construction

A Space-Efficient Construction of the Burrows-Wheeler Transform for Genomic Data

Optimal Time and Space Construction of Suffix Arrays and LCP Arrays for Integer Alphabets

عنوان ژورنال:

اشتراک گذاری